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(57)Abstract: 

PURPOSE: To reduce the calculation quantity 
while securing the recognition performance by 
narrowing down word candidates previously by 
collation arithmetic, and then collating the 
y narraowed-down candidates by using data 
which are not thinned out. 
CONSTITUTION: A 2nd collation/decision part 
12 collates parameters and word standard 
patterns stored in a parameter storage part 10 
as to word candidates and a collation section 
without a thinning-out process while an end 
point is fixed and then outputs the word 
candidate which gives the best collation result 
as a recognition result. The collation section 
supplied to the 2nd collation/ decision part 12 is made longer than an actual speech 
section, so the word standard parameters used for the collation are used after 
performing a processing for connecting an environment standard pattern stored in an 
environment standard pattern storage part 14 to both ends of the word standard 
pattern obtained by a word standard pattern generation part 7 is performed by a 2nd 
word standard pattern processing part 13. 
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CLAIMS 



[Claim(s)] 

[Claim 1] A voice-analysis means which is the base unit of analysis of an input sound 
signal to analyze for every frame and to extract an analysis parameter, A frame clock 
generation means to utter the timing signal of an analysis frame, A dividing means to 
carry out dividing of the above-mentioned frame clock by the predetermined division 
ratio, and to acquire a dividing clock signal, A voice piece standard-pattern storing 
means to store the voice piece standard pattern constituted by the sequence of the 
partial standard pattern expressing some of word dictionaries which wrote the word by 
the sequence of the notation showing a voice piece, and piece data of voice created 
using the voice data which many men uttered beforehand, A word standard-pattern 
generation means to obtain the standard pattern of a word by connecting the 
above-mentioned voice piece standard pattern according to the content of a notation 
of the above-mentioned word dictionary, The 1st word standard-pattern processing 
means which creates a data infanticide word standard pattern by thinning out a part of 
feature-parameter sequence which constitutes the above-mentioned word standard 
pattern, The partial distance which is the distance between the feature parameters 
and the partial standard patterns of the above-mentioned data infanticide word 
standard pattern which are obtained from the above-mentioned analysis parameter in 
a simultaneous point whenever it receives the above-mentioned dividing clock signal 
is computed. By accumulating the partial distance between the above-mentioned data 
infanticide word standard patterns already called for from the event concerned and 
the feature-parameter sequence before it The start edge location which accompanies 
the minimum distance and it to the input of the data infanticide word standard pattern 
at the time of assuming the event concerned to be the termination of a word is 
obtained. 1st collating/judgment means which combines the minimum 
above-mentioned distance with the above-mentioned start edge location, and 
updates it for every word for every above-mentioned dividing clock, A candidate word 
selection means to be at the termination event of input voice and to obtain a 
predetermined number candidate word in order with a small distance value by 



comparing the distance over the standard pattern of all the words for recognition 
mutually, The endpoint positioning means which determines the section which 
certainly includes the voice section from the start edge which accompanies the 
candidate word chosen by the above-mentioned candidate selection means, and a 
termination candidate group, A parameter storage means to memorize the 
above-mentioned analysis parameter over all the input sections, An environmental 
pattern storing means to store the environmental standard pattern beforehand 
created from the acoustic signal of the sections other than voice, The 2nd word 
standard-pattern processing means which connects the above-mentioned 
environmental pattern before and after the above-mentioned word standard pattern, 
and creates a word standard pattern with an environmental pattern, It computes, 
when partial distance accumulates the distance between the parameter sequences in 
the section determined by the above-mentioned endpoint positioning means stored in 
the word standard pattern with an environmental pattern and the above-mentioned 
parameter storage means corresponding to the word candidate group chosen by the 
above-mentioned word candidate selection means. The voice recognition unit which 
consists of the 2nd collating/judgment means which outputs the word candidate who 
acquired the distance value with the smallest value by carrying out the mutual 
comparison of the distance acquired for every above-mentioned candidate word as a 
recognition result. 

[Claim 2] The voice recognition unit according to claim 1 characterized by simplifying 
the count in calculation and word collating of partial distance in processing of 1st 
collating/judgment means using inter-frame length. 

[Claim 3] The 1st word standard-pattern merge means which creates a data 
infanticide merge word standard pattern by packing into one the partial standard 
pattern of the data infanticide word standard pattern created with the 1st word 
standard-pattern processing means by making the same multiple frame into a group is 
added. The partial distance count section which computes the partial distance which 
is the distance between the feature parameters and the partial standard patterns of 
the above-mentioned data infanticide merge word standard pattern which are 
obtained from the analysis parameter in a simultaneous point whenever 1st 
collating/judgment means receives a dividing clock signal, The representation partial 
distance selection section which compares said partial distance with the partial 
distance at the front [ event / concerned ] event, and makes the one where distance 
is smaller representation partial distance, The distance accumulation section which 
accumulates the representation partial distance between the above-mentioned data 



infanticide word standard patterns already called for from the event concerned and 
the feature-parameter sequence before it, The start edge location which 
accompanies the minimum distance and it to the input of the data infanticide word 
standard pattern at the time of assuming the event concerned to be the termination 
of a word is obtained. The voice recognition unit according to claim 1 characterized by 
having the judgment section which combines the minimum above-mentioned distance 
with the above-mentioned start edge location, and updates it for every word for every 
above-mentioned dividing clock, and simplifying the count in calculation and word 
collating of partial distance. 

[Claim 4] Partial distance is a voice recognition unit according to claim 1 to 3 which 
computes using a statistical interval scale and is characterized by the 
above-mentioned statistical interval scale being an interval scale based on 
a-posteriori probability. 

[Claim 5] Partial distance is a voice recognition unit according to claim 1 to 3 which 
computes using a statistical interval scale and is characterized by the 
above-mentioned statistical interval scale being a primary discriminant based on 
a-posteriori probability. 



DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Industrial Application] This invention relates to the approach of speech recognition of 

making a machine recognizing human being s voice. 

[0002] 

[Description of the Prior Art] Although there are an approach for a specified speaker 
and an approach for an unspecified speaker in speech recognition, unspecified 
speaker recognition is targetted especially for this invention. As an example of the 
approach for an unspecified speaker, the example based on Japanese Patent 
Application No. No. 314248 [ three to ] is explained, referring to drawing 9 . 
[0003] In drawing 7 61 the feature-parameter extract section and 63 for the 
sonagraphy section and 62 A voice section detecting element, The word dictionary in 
which two or more frame buffers and 65 described all the words for recognition in the 
voice piece standard-pattern storing section, and 64 described 66 along the voice 



piece, The word standard-pattern generation section which generates the word 
standard pattern of the vocabulary for recognition when 67 chooses and connects a 
voice piece standard pattern according to the list of a voice piece, The partial 
distance count section which finds the partial distance of the input vector and the 
partial pattern of the voice for recognition with which 68 was formed with two or more 
frames by the statistical interval scale based on a-posteriori probability, It is the 
judgment section which makes a recognition result the voice name which the distance 
accumulation section which finds the distance of input voice and a word standard 
pattern by accumulating partial distance over the whole voice, and 610 make 
accumulation distance as the path judging section while 69 shifts an input frame, and 
61 1 makes min. 

[0004] The sonagraphy section 61 carries out the AD translation of the input signal, 
and is fixed time amount length (it is called a frame.). In this conventional example, it 
analyzes to every 10ms. In the feature-parameter extract section 62, a feature 
parameter is extracted based on the output of the sonagraphy section 61. The voice 
section detecting element 63 detects the start edge of input signal voice, and 
termination. Although the method of detecting the voice section is easy and is 
common, what kind of approach may be used. [ of the method of using audio power ] 
Moreover, about this, it mentions later using the approach of word spotting which 
performs a collating operation, assuming the entire interval of an input to be an 
endpoint, without performing voice section detection. Two or more frame buffers 64 
are parts which form the input vector which unifies the feature parameter of the 
frame of the neighborhood of each frame, and is used for pattern matching (partial 
matching). The standard pattern of a voice piece is stored in the voice piece 
standard-pattern storing section 65 as association of a partial pattern. The link 
information of a voice piece is described by the word dictionary 66 for every word to 
recognize. The voice piece connection section 67 reads and connects the voice piece 
standard pattern stored in the voice piece standard-pattern storing section 65 
according to this voice piece link information. In the partial distance count section 68, 
the distance between a word standard pattern and two or more frame buffers (partial 
distance) is calculated. The distance accumulation section 69 accumulates the partial 
distance over each word, and asks for the similarity to the whole word. The path 
judging section 610 chooses the path from which accumulation distance becomes min. 
The judgment section 61 1 is outputted in quest of the word which gives the minimum 
value of accumulation distance. 

[0005] Next, the case where the word-spotting method do not perform voice section 



detection is used is explained. Since the voice section detection generally weakened 
at a noise does not need to be used for the advantage of the word-spotting method, it 
is that a recognition system strong against a noise is realizable. Since voice section 
detection is not performed in the case of the word-spotting method, a collating 
operation is performed about sufficient long section containing voice. That is f using a 
collating initiation event as the audio start edge, and carrying out a collating operation 
for a collating termination event as audio termination like [ in the case of performing 
voice section detection ], does not have semantics. By the word-spotting method, the 
collating score to a word standard pattern is computed by assuming an audio always 
edge about all the input sections. 
[0006] 

[Problem(s) to be Solved by the Invention] The speech recognition for unspecified 
speakers with a high precision is possible for the approach explained in the 
conventional example using positively the information on "a neighboring inter-frame 
time motion", and by using a statistical interval scale. Moreover, since it is the 
approach of connecting a voice piece, implementation of the high recognition 
equipment of the versatility in which lexical modification is possible only by rewriting a 
word dictionary is possible. Furthermore, since precise voice section detection 
becomes unnecessary by performing word spotting, recognition equipment strong 
against a noise is realizable. 

[0007] However, since this approach was finding partial distance about all the input 
sections and the entire interval of the standard pattern of a word in addition to the 
number of dimension of a feature parameter being large since the analysis parameter 
of the section (multiple frame) with width of face including the neighboring frame of 
the frame made into the feature parameter is used, although it used the linearity 
discernment type for count of partial distance, it had the trouble that there was still 
much computational complexity. Moreover, when word spotting was used, there was a 
problem of "partial matching" that a certain word matched with a part of other words, 
and served as incorrect recognition like the example of "Fujiidera" and "Fuji." 
[0008] 

[Means for Solving the Problem] In order to solve the problem described above in this 
invention A voice-analysis means which is the base unit of analysis of an input sound 
signal to analyze for every frame and to extract an analysis parameter, A frame clock 
generation means to utter the timing signal of an analysis frame, A dividing means to 
carry out dividing of the above-mentioned frame clock by the predetermined division 
ratio, and to acquire a dividing clock signal, A voice piece standard-pattern storing 



means to store the voice piece standard pattern constituted by the sequence of the 
partial standard pattern expressing some of word dictionaries which wrote the word by 
the sequence of the notation showing a voice piece, and voice piece data created 
using the voice data which many men uttered beforehand, A word standard-pattern 
generation means to obtain the standard pattern of a word by connecting the 
above-mentioned voice piece standard pattern according to the content of a notation 
of the above-mentioned word dictionary, The 1st word standard-pattern processing 
means which creates a data infanticide word standard pattern by thinning out a part of 
feature-parameter sequence which constitutes the above-mentioned word standard 
pattern, The partial distance which is the distance between the feature parameters 
and the partial standard patterns of the above-mentioned data infanticide word 
standard pattern which are obtained from the above-mentioned analysis parameter in 
a simultaneous point whenever it receives the above-mentioned dividing clock signal 
is computed. By accumulating the partial distance between the above-mentioned data 
infanticide word standard patterns already called for from the event concerned and 
the feature-parameter sequence before it The start edge location which accompanies 
the minimum distance and it to the input of the data infanticide word standard pattern 
at the time of assuming the event concerned to be the termination of a word is 
obtained. 1st collating/judgment means which combines the minimum 
above-mentioned distance with the above-mentioned start edge location, and 
updates it for every word for every above-mentioned dividing clock, A candidate word 
selection means to be at the termination event of input voice and to obtain a 
predetermined number candidate word in order with a small distance value by 
comparing the distance over the standard pattern of all the words for recognition 
mutually, The endpoint positioning means which determines the section which 
certainly includes the voice section from the start edge which accompanies the 
candidate word chosen by the above-mentioned candidate selection means, and a 
termination candidate group, A parameter storage means to memorize the 
above-mentioned analysis parameter over all the input sections. An environmental 
pattern storing means to store the environmental standard pattern beforehand 
created from the acoustic signal of the sections other than voice. The 2nd word 
standard-pattern processing means which connects the above-mentioned 
environmental pattern before and after the above-mentioned word standard pattern, 
and creates a word standard pattern with an environmental pattern, It computes, 
when partial distance accumulates the distance between the parameter sequences in 
the section determined by the above-mentioned endpoint positioning means stored in 



the word standard pattern with an environmental pattern and the above-mentioned 
parameter storage means corresponding to the word candidate group chosen by the 
above-mentioned word candidate selection means. 2nd collating/judgment means 
which outputs the word candidate who acquired the distance value with the smallest 
value as a recognition result is established by carrying out the mutual comparison of 
the distance acquired for every above-mentioned candidate word. 
[0009] 

[Function] After this invention narrows down a word candidate beforehand by the 
collating operation using the data thinned out by each above-mentioned means 
division dividing means and the 1st word standard-pattern processing means The 1st 
operation effectiveness of aiming at the cutback of computational complexity while 
securing the recognition engine performance by performing collating using the data 
which do not cull out to the narrowed-down candidate, While realizing recognition 
strong against a noise by collating including the section of the outside of the voice 
section by using the above-mentioned word standard pattern with an environmental 
pattern In the case of word spotting, when a certain word used as a problem collates 
with the one section of other words, it has the 2nd operation effectiveness of solving 
the problem of partial matching which incorrect recognition produces. 
[0010] 

[Example] Hereafter, the 1st example of this invention is explained using a drawing. 
Drawing 1 shows the configuration of the 1st example of this invention. In drawing 1 
the sonagraphy section and 2 1 The frame clock signal generating section, In 3, the 
dividing section and 4 a word dictionary and 6 for 1st collating/judgment section and 5 
The voice piece standard-pattern storing section, 7 — the word standard-pattern 
generation section and 8 — for the parameter storage section and 1 1, as for 2nd 
collating/judgment section and 13, the endpoint fixing section and 12 are [ the 1st 
word standard-pattern processing section and 9 / the candidate selection section 
and 10 / the 2nd word standard-pattern processing section and 14 ] the 
environmental standard-pattern storing sections. Next, the actuation is explained. 
[0011] The sonagraphy section 1 carries out the AD translation of the input signal, 
and is fixed time amount length (it is called a frame.). In this example, it analyzes to 
every 10ms. Linear predictive coding (LPC analysis) is used in the example. The timing 
of a frame is given by the clock signal which the frame clock signal generating section 
2 generates, and this clock signal is supplied to the sonagraphy section 1 and the 
dividing section 3. The dividing section 3 carries out dividing of the frame clock signal 
by the predetermined division ratio (this example 2), and outputs a dividing clock signal. 



This dividing clock signal is supplied to 1st collating/judgment section 4, and is used 
for inter-frame length. 1st collating/judgment section 4 performs collating between 
the analysis parameter which the sonagraphy section 1 outputs, and the word 
standard pattern generated by the below-mentioned processing by word spotting. 
About the detail of processing of this part, it mentions later. 
[0012] Next, the generation method of the word standard pattern used in 1st 
collating/judgment section 4 is explained. A recognition vocabulary is expressed along 
voice piece notations, such as valve flow coefficient and VC, and is stored in the word 
dictionary 5. A word standard pattern is generated by the word standard-pattern 
generation section 7 by connecting the voice piece standard pattern stored in the 
voice piece standard-pattern storing section 6 according to the list of the voice piece 
notation obtained with reference to the word dictionary 5. 

[0013] In addition, about the creation approach of a voice piece standard pattern, it 
mentions later. Collating in 1st collating/judgment section 4 is performed to the data 
with which inter-frame length was made for a computational complexity cutback. 
Therefore, the word standard pattern used in 1st collating/judgment section 4 also 
needs to give inter-frame length. The 1 st word standard-pattern processing section 8 
performs processing of this inter-frame length to the word standard pattern obtained 
in the word standard-pattern generation section 7. Only the predetermined number 
chooses a word candidate as order with the collating result of all the words from 
which the candidate selection section 9 was obtained in 1st collating/judgment 
section 4 to a sufficient collating result. The parameter storage section 10 memorizes 
the analysis parameter obtained in the voice-analysis section 1 about all the input 
sections. 

[0014] In the endpoint fixing section 1 1 t the information on the start edge obtained 
along with the above-mentioned word candidate's each and termination is unified, and 
the collating section for collating in 2nd collating/judgment section 12 is determined. 
It is determined that this collating section will surely include the voice section. 
Therefore, the section longer than the actual voice section is obtained. For example, 
the head of the start edge group obtained along with a word candidate or a location 
further front is determined as the start edge of the collating section. The same is said 
of the case of termination and the tail of the termination group obtained along with a 
word candidate or a further next location is determined as termination of the collating 
section. 

[0015] In 2nd collating/judgment section 12, after collating endpoint immobilization 
which does not cull out about the above-mentioned word candidate and the collating 



section to the word standard pattern obtained according to the parameter memorized 
by the parameter storage section 10 and the below-mentioned processing, the word 
candidate who gives the best collating result is outputted as a recognition result. 
Since the collating section longer than the actual voice section given to 2nd 
collating/judgment section 12 is taken as above-mentioned, what performed 
processing which connects the environmental standard pattern stored in the ends of 
the word standard pattern obtained by the word standard-pattern generation section 
7 by the 2nd word standard-pattern processing section 13 at the environmental 
standard-pattern storing section 14 is used for the word standard pattern used for 
collating. This environmental standard pattern is created from the pattern of the noise 
signal with which recognition equipment is used beforehand, for example. 
[0016] Next, the content of processing performed in 1st collating/judgment section 4 
and 2nd collating/judgment section 12 is explained in detail. It is the point which is 
collating endpoint immobilization which the latter gives the collating section 
beforehand to performing the collating operation about the data which do not cull out 
in the latter, and the former performing word spotting by collating of the endpoint 
free-lancer who does not give the section of collating to both difference carrying out 
the collating operation about the data with which the former carried out inter-frame 
length. Since fundamental views, such as a feature parameter used for other collatings 
and an interval scale to be used, are the same, the detail of collating processing 
explains the part which describes 1st collating/judgment section 4, and has a 
difference each time. 

[0017] The detailed block diagram in which drawing 2 shows the flow of processing of 
1st collating/judgment section 4, and drawing 3 are the block diagrams of 2nd 
collating/judgment section 12. Although explanation is mainly given using drawing 2 , 
drawing 3 and drawing 3 are referred to if needed. Moreover, what has a the same 
name has the same function about each component of drawing 2 and drawing 3 . For 
two or more frame buffers and 22, as for the distance accumulation section and 24, in 
drawing 2 and drawing 3 , the partial distance count section and 23 are [ 21 / the path 
judging section and 25 ] the judgment sections. 

[0018] In drawing 2 , two or more frame buffers 21 are parts which form the input 
vector which unifies the feature parameter of the frame of the neighborhood of the 
i-th frame, and is used for pattern matching (partial matching). The input vector in the 
i-th frame [0019] 
[External Character 1] 
Xi 



[0020] ** — it is expressed as follows. 

[0021] 

[Equation 1] 

Xt=(xi_ L1 , Xi_ Llfmj - ? x it - y x i+L2 ) 

[0022] This is the vector which unified the feature parameter of i— L 1 - two i+L every 
m frames. L1=L2=3 and m= 2 If it carries out the number of dimension of Xi will be set 
to x(p+2) {(L1+L2+1) /m+1} =12x4=48. When m takes two or more values, it is 
equivalent to thinning out a frame and forming an input vector. The voice piece 
standard-pattern storing section 6 is a part which has stored the standard pattern of 
a voice piece as association of a partial pattern. The voice piece standard-pattern 
creating method is explained a little to a detail here. 

[0023] the audio element used as a base unit of speech recognition with the [voice 
piece standard-pattern creation approach] voice piece — it is — as a class — a 
phoneme, syllable (valve flow coefficient), and semitone — knot (VC, valve flow 
coefficient) and vowel - a consonant — there is a - vowel chain (VCV) etc. In addition, 
C means a consonant and V means a vowel. The following explanation explains the 
case where syllable (valve flow coefficient) is used as a class of voice piece as an 
example. 

[0024] For example, the standard pattern of voice piece / sa/is created with the 
following means. 

(1) Start /sa/and the uttered part from the voice data which many men uttered 
(suppose that the 100-piece sample is started). 

(2) Investigate 100 persistence time distribution of /sa/and ask for 100 mean-time 
length JS. 

(3) Discover the sample of the time amount length of JS out of 100 pieces. When 
there are two or more samples, the average of two or more samples is calculated for 
every frame. Thus, the called-for representation sample [0025] 

[External Character 2] 
I 



[0026] [0027] 
[Equation 2] 

S J = t s j-Li, &j-Ll+m, Sj, s ,-+L2 )( J -l.-'.JS) 



[0028] It carries out. here — sj — the parameter vector per frame — it is — an 
analysis parameter — the same — the LPC cepstrum multiplier of 1 1 pieces, and 
difference — it consists of power. 

(4) Perform pattern matching between each sample for 100 pieces (several 1), and a 
representation sample (several 2), and ask for the inter-frame response relation 
between the frame of a representation sample, and each sample for 100 pieces (a 
most similar frame comrade is matched). In addition, if the technique of a dynamic 
programming is used, it can ask for inter-frame response relation efficiently. 

(5) Start the partial vector of the form of (several 1) from each sample for 100 pieces 
corresponding to each frame Q=1-JS) of a representation sample. Since it is easy 
11=12=3 and m= 1 It carries out. 

[0029] It is the partial vector of the n-th sample among the data for 100 pieces 
equivalent to thej-th frame of a representation sample [0030] 
[Equation 3] 

Xn / n n n n \ 

i — I X j-Ll , x j -L 1 + m, "\ x j , "', * j+L2 ) 

[0031] It carries out. Here shows that j is a frame corresponding to the j-th frame of 
the inside of the n-th sample of same word / sa/ f and a representation vector. In this 
example, it is a 48-dimensional vector (n=1~100). 

(6) 100 [0032] 
[External Character 3] 

X n i 

[0033] ******** [0034] 
[External Character 4] 

[0035] (48 dimensions) and a covariance matrix [0036] 

[External Character 5] 

W, 

[0037] (48x48 dimensions) are searched for G=1-JS). As for the average and a 
covariance matrix, only the number JS of standard frame length will exist (however, it 
is not necessary to necessarily create these to all frames). You may thin out and 
create. Above-mentioned (1) - (6) It is [0038] also to voice pieces other than a voice 
piece / sa by the same procedure. 



[External Character 6] 
A j . Wj 



[0039] ********. It is the moving average [0040] to all the sample data to all the voice 
sections. 

[External Character 7] 
/** 

[0041] (48 dimensions) and a migration covariance matrix [0042] 
[External Character 8] 

Wx 

[0043] (48x48 dimensions) are searched for. These are called a perimeter pattern. 
Next a standard pattern is created using the average and a covariance. 
a. [0044] which communalizes a covariance matrix 
[Equation 4] 

( Y~Z Wh , j + e ■ W x ) / ( i + k ) 

h j 

( ITL it ; <t Z> J <Dl&\#t ) 

[0045] In the case of valve flow coefficient h is about 130 by the class of voice piece 
here. Moreover, g is a rate which mixes a perimeter pattern and is usually g= 1. It 
carries out. 

[0046] b. The partial pattern of each voice piece [0047] 
[External Character 9] 
Ah, j , B h . j 

[0048] It creates. 
[0049] 
[Equation 5] 

Ah. j = 2(/ lhj i W" 1 - /^W" 1 ) 

[0050] 
[Equation 6] 



[0051] Derivation of these formulas is mentioned later. The example of the voice piece 
standard-pattern creating method is shown in drawing 4 . A frame response with a 
correlation sample is searched for between the start edges and termination of the 
sample for study, and it divides a voice piece sample into JS. It asks for a response 
frame with a representation sample in drawing 4 , and is Q). It is shown. And it is 
G)-L1-Q) + L2 about each of G) = 1~ (JS). The average and a covariance are calculated 
using the data for 100 pieces of the section, and it is a partial pattern [0052]. 
[External Character 10] 
Ah, j ^ B h , j 

[0053] ******** Therefore, voice piece h A standard pattern becomes what 
connected and gathered up the partial pattern of Jh individual including the section 
which overlaps **. A perimeter pattern asks for the average and a covariance, shifting 
the one L1+L2+1 frame partial section at a time, as shown in drawing. Not only the 
voice section but the noise section of order of the range of perimeter pattern creation 
is good also as an object. The voice piece standard pattern obtained about each word 
is beforehand stored in the voice piece standard-pattern storing section 6. 
[0054] The link information of a voice piece is described by the [voice piece 
connection] word dictionary 5 for every word to recognize, and the example is shown 
in drawing 5 . The word standard-pattern generation section 7 reads and connects the 
voice piece standard pattern stored in the voice piece standard-pattern storing 
section 6 according to this voice piece link information. Of this connection actuation, 
as shown in the example of drawing 6 , the false standard pattern (it is hereafter 
described as a "word standard pattern") of a word is formed. It is the word standard 
pattern of the word "k" created as mentioned above [0055] 
[Equation 7] 

A*, j = 2 C/* w , j W^-y^W" 1 ) 

[0056] 
[Equation 8] 

Bk f i = /»k, jW" 1 /^, j'-z^w-V* 1 

[0057] It expresses. In addition, in the case of drawing 2 , the data which performed 
inter-frame length in the 1st word standard-pattern processing section 8 are used as 
a word standard pattern as above-mentioned. In the case of drawing 3 , inter-frame 
length is not performed, but in the 2nd word standard-pattern processing section 13. 



the standard pattern into which the environmental standard pattern stored in the ends 
of a word standard pattern at the environmental standard-pattern storing section 14 
was added and processed is used. 

[0058] The distance between the word standard pattern and two or more frame 
buffers which are [count of partial distance] above, and were made and formed (partial 
distance) is calculated in the partial distance count section 22. In addition, since it is 
collating about inter-frame length data in the case of drawing 2 , the suffixes i and j 
showing the frame number used by future explanation shall newly regive a number 
about the frame which performed inter-frame length. 

[0059] Between input vectors and the partial patterns of each word including the 
information on the multiple frame shown by (several 1), count of partial distance is 
calculated using a statistical interval scale. Since the distance as the whole word will 
accumulate and find distance (partial distance) with a partial pattern, it needs to 
calculate partial distance irrespective of the location of an input, or the difference in a 
partial pattern by the approach which a distance value can compare mutually. For that, 
it is necessary to use the interval scale based on a-posteriori probability. Namely, the 
j-th partial pattern of an input (several 1) and the word "k" [0060] 
[External Character 11] 

[0061] About distance, it is a-posteriori probability [0062]. 
[External Character 12] 

P ( j 1 Xi ) 

[0063] Therefore, it calculates. It becomes like a degree type by Bayes' theorem. 

[0064] 

[Equation 9] 

POk, j IXi )=P (® k| j) -PCX; \*> k , jJ/PCX) 

[0065] The 1st term of the right-hand side considers that the appearance probability 
of each word is the same, and deals with it as a constant. The a-priori probability of 
the 2nd term of the right-hand side considers distribution of a parameter to be normal 
distribution, and becomes like a degree type. 
[0066] 

[Equation 10] 



P CX. I* k , j)=C23r)- d - 2 |W kf jI" 1 ' 2 



[0067] (Several 10) is the sum of a probability to all the input conditions that may 
occur also including a word and its circumference information, and a parameter can 
think that it becomes a distribution configuration near normal distribution in the case 
of an LPC cepstrum multiplier or a band pass filter output. Here, for (several 10), an 
average and a covariance are [0068], respectively. 
[External Character 1 3] 

/*x> Wx 



[0069] It is assumed that it is a thing according to **********. 
[0070] 

[Equation 1 1] 

P (X)=(2 7T T d /2 ,-1/2 . exp {-l/ 2 (Xi-/i x ) 

W^CX-Ax) 1 } 



[0071] A degree type will be obtained, if substitute (several 10) and (several 1 1) for 
(several 9), a logarithm is taken, a constant term is omitted and it doubles -two further. 
[0072] 

[Equation 12] 

L k ( i. j ) -CXi-zifc, j)W k , j^CXi-Zik, j ) t -cx i - / « x ) 
Wx'HX,-^) 1 ^ iog( lW k , ji/iW^I) 

[0073] This formula is a formula which carried out a-posteriori probability of the 
BEIZU distance, and although discernment capacity is high, the fault that there is 
much computational complexity has it. This formula is developed to a linearity 
discriminant as follows. Covariance matrices also including all the partial patterns and 
perimeter patterns to all words assume that it is an equal. A covariance matrix is 
communalized by (several 4) on the basis of such an assumption, and if it substitutes 
for (several 12) and arranges, the easy following linearity discernment types can be 
drawn. 
[0074] 



[Equation 13] 

L*( i . j ) =B k , j - A k , j • Xj ' 

Bk, j=/*k, iW"V k? j 1 -/., 1 wW 

[0075] 

[External Character 14] 

Ak , j , Bk, j 

[0076] ** (several 7) and (s everal 8) will already show, and the j— th standard pattern 
of the word "k" will be expressed by this pair. 

[0077] The distance accumulation section 23 is a part which accumulates to the 
section of partial distance j=1 — Jk to each word, and asks for the similarity to the 
whole word. In that case, it is necessary to accumulate, expanding and contracting an 
input part (I frames) in the allowed-time length Jk of each word. This count is 
efficiently calculable using the technique (the DP method) of a dynamic programming. 
[0078] In drawing 2 , since it is with the word-spotting method by collating the 
endpoint free-lancer who does not perform voice section detection, processing of 
word collating is as follows. Since voice section detection is not performed in the case 
of the word-spotting method, a collating operation is performed about sufficient long 
section containing voice. That is, using as the audio start edge i= 1 which it is at the 
collating initiation event like [ in the case of performing voice section detection ], and 
carrying out a collating operation for i=I as audio termination does not have semantics. 
By the word-spotting method, the collating score to a word standard pattern is 
computed by assuming an audio always edge about all the input sections. That is, the 
accumulation operation of partial similarity performed in the path judging 24 is as 
follows. Here, the subscript k of a word number is omitted, the partial distance of the 
i-th frame part of an input and the j— th partial pattern will be expressed as L (i, j), and 
the accumulation distance to a frame (i, j) will be expressed as g (i, j). The path judging 
section 24 is [0079]. 
[Equation 14] 



g( i . 1 ) 



= L( i . I ) 



g( i , j ) =min 



g( i -2 . j - 1 ) + L( i , j ) 
g( i- 1 , j-1 ) + L( i , j ) 
g( i-i , j-2)+L( i , j-1 )+LC i . j ) 



(KiSI, Ki^J) 

[0080] ****** is performed and the path from which accumulation distance becomes 
min among three paths shown by the formula is chosen. Thus, in the judgment section 
25, use this g (i, J) as the final collating score of a word standard pattern, and after 
accumulating distance serially, when g (i, J) takes the smallest value to i, let i at this 
time be audio termination. The audio start edge can be obtained by following the path 
which the path judging section 24 judged. 

[0081] Since the operation performed in the path judging section 34 of drawing 3 is 

processing of endpoint immobilization, it is as follows. 

[0082] 

[Equation 15] 

gC 1 . I ) = L( 1 . 1 ) 



g C i , j ) — min 



g( i -2 , j -1 } +LC i , j ) 
g( i-1 . j-1 ) + L( i , j ) 
I *( i -1 , j -2 ) t L( i . j -1 ) + L( i . j ) 



(Ki^LKjSJ) 



[0083] In (several 15), for convenience, the audio frame i is reattached so that the 
start edge of the collating section may be set to 1 and termination may be set to lin a 
number. 

[0084] The path judging section 34 chooses the path from which accumulation 
distance becomes min among three paths shown by (several 15). Thus, distance is 
accumulated serially and it considers as the collating score of the accumulation 



distance g (i, Jk) word in the event of becoming j=Jk and i=I "k." The judgment section 
35 is outputted in quest of the word "k" which gives the minimum value of the 
accumulation distance g (i f Jk). 

[0085] Hereafter, the 2nd example of this invention is explained. Drawing 7 shows the 
block diagram of the 2nd example of this invention. In drawing 7 , the same number is 
given to the same component as drawing 1 . Different points from the 1st example are 
1st collating/judgment section 4 and the word standard-pattern merge section 41, 
and explain in detail the content of processing performed by these next. First, the 
content of processing performed in the word standard-pattern merge section 41 is 
explained. For the computational complexity cutback by partial distance count, two 
partial standard patterns are made into a group, and are packed into one. Since the 
linearity discriminant is used, it is equal to finding partial distance to ask for the sum of 
the partial distance on DP pass, after adding a corresponding parameter previously. 
Therefore, this processing will fix DP pass in every two frames to one. Partial standard 
pattern [0086] 
[External Character 15] 
A k , B 



[0087] [0088] of ****** 
[External Character 1 6] 

Ak ,2n- I 



[0089] [0090] 
[External Character 1 7] 



[0091] It shifts by one frame, merges and creates. That is, it is [0092] when the 
conventional partial distance is shown in (several 16) and (several 17). 
[Equation 16] 

Li< ( i , 2 n - I ) = B k 2 n-l ~ Ak , 2r. -1 ' X| -i * 
[0093] 

[Equation 17] 

L k < i , 2 n ) = B k %2n " A k t 2 n " Xj ' 



[0094] Partial distance summarizes the two above-mentioned formula, and is [0095]. 



[Equation 18] 

L k (i»n)=B k>n -Ak,n * Xj 

[0096] It becomes equal to finding the conventional partial distance a condition [fixing 
a next door and DP pass in every two frames to one ]. 

[0097] If the feature parameter used for partial distance count is L frames by this 
amelioration (L+1), computational complexity is reducible to /2L In the 2nd example, 
as a parameter if L= 4, the computational complexity in this case is reducible to five 
eighths. 

[0098] Next, the content of processing performed in 1 st collating/judgment section 42 
is explained using a drawing. Drawing 8 is the block diagram showing the detail of the 
flow of processing of 1st collating/judgment section 42. The partial distance of the 
word standard pattern and two or more frame buffers which are obtained from the 1st 
word standard-pattern processing section 8 and the word standard-pattern merge 
section 41 is calculated in the partial distance count section 22. It is representation 
partial distance [0099] beforehand to the partial distance for two frames to an input 
about the one where distance is smaller at the representation partial distance 
selection section 51. 
[External Character 18] 
typ L k ( 2 i , n ) 

[0100] It will be set to (several 19) if it carries out. 
[0101] 

[Equation 19] 

typL k (2i , n)=min (L k (2i-l,n),L ic (2i ,n)) 

[0102] About this representation partial distance, it accumulates in the distance 
accumulation section 23, and asks for the similarity to the whole word. In that case, it 
is necessary to accumulate, expanding and contracting an input part (I frames) in the 
allowed-time length Jk of each word. This count is efficiently calculable using the DP 
method like the 1st example. 

[0103] In drawing 8 , since it is with the word-spotting method by collating the 
endpoint free-lancer who does not perform voice section detection, processing of 
word collating is as follows. Since voice section detection is not performed in the case 
of the word-spotting method, a collating operation is performed about sufficient long 
section containing voice. By the word-spotting method, the collating score to a word 



standard pattern is computed by assuming an audio always edge about all the input 
sections. That is, the accumulation operation of partial similarity performed in the path 
judging 52 is as follows. The subscript k of a word number is omitted, the partial 
distance of the i-th frame part of an input and the j-th partial pattern is expressed as 
typL (i, j) here, and it is the accumulation distance to a frame (i, j) [0104] 
[External Character 1 9] 
?Ci . i ) 

[0105] It will express. The path judging section 52 is [0106]. 
[Equation 20] 

g ( i . 1 ) = typ L( i . I ) 



g*(i-l,j-l)»-typLCi.j) 
g ( i • j ) = m i n { g ( i-2 t j -1 ) h ty p L C i , j ) 
{ ? ( i-3 , j-1 ) + typ'Lt i , j ) 
( K.ii I . 1< j^J ) 



[0107] ****** is performed and the path from which accumulation distance becomes 
min among three paths shown by the formula is chosen, thus, the time of g (i, J) taking 
the smallest value to i in the judgment section 25, after accumulating distance serially 
— this — Use g (i, J) as the final collating score of a word standard pattern, and let i 
at this time be audio termination. The audio start edge can be obtained by following 
the path which the path judging section 52 judged. Henceforth, the same processing 
as the 1st example is performed. 
[0108] 

[Effect of the Invention] As explained above, after this invention narrows down a word 
candidate beforehand by the collating operation using the data thinned out first If it is 
effective in aiming at the cutback of computational complexity and the case where it 
realizes by the hardware of the same magnitude is considered, securing the 
recognition engine performance by performing collating using the data which do not 
cull out to the narrowed-down candidate The number of vocabularies is expandable 
about single figure, maintaining the recognition engine performance compared with the 
conventional example. Moreover, in order not to perform detection of that the 
phenomenon of partial matching which incorrect recognition produces when the word 



which poses a problem collates with the one section of other words by collating 
including the section of the outside of the voice section by using a word standard 
pattern with an environmental pattern in the case of word spotting does not arise, and 
the precise voice section, either, it becomes realizable [ a dogged voice recognition 
unit ] to a noise. 

[0109] Furthermore, the count of a comparison operation is substantially reduced by 
count of a partial product reducing and reducing the lattice points of DP to each of a 
dictionary shaft and input shafts 1/2 by merging a partial standard pattern for every 
multiple frame in the 2nd example. 
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